{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment analysis with pretrained word vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Chapter 15, Word Embeddings, we discussed how to learn domain-specific word embeddings. Word2vec, and related learning algorithms, produce high-quality word vectors, but require large datasets. Hence, it is common that research groups share word vectors trained on large datasets, similar to the weights for pretrained deep learning models that we encountered in the section on transfer learning in the previous chapter.\n",
    "\n",
    "We are now going to illustrate how to use pretrained Global Vectors for Word Representation (GloVe) provided by the Stanford NLP group with the IMDB review dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from datetime import datetime, date\n",
    "from sklearn.metrics import mean_squared_error, roc_auc_score\n",
    "from sklearn.preprocessing import minmax_scale\n",
    "from keras.callbacks import ModelCheckpoint, EarlyStopping\n",
    "from keras.datasets import imdb\n",
    "from keras.models import Sequential, Model\n",
    "from keras.layers import Dense, LSTM, GRU, Input, concatenate, Embedding, Reshape\n",
    "from keras.preprocessing.sequence import pad_sequences\n",
    "from keras.preprocessing.text import Tokenizer\n",
    "import keras\n",
    "import keras.backend as K\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set_style('whitegrid')\n",
    "np.random.seed(42)\n",
    "K.clear_session()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to load the IMDB dataset from the source for manual preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data source: [Stanford IMDB Reviews Dataset](http://ai.stanford.edu/~amaas/data/sentiment/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = Path('data/aclImdb/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50000"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "files = path.glob('**/*.txt')\n",
    "len(list(files))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "files = path.glob('*/**/*.txt')\n",
    "data = []\n",
    "for f in files:\n",
    "    _, _, data_set, outcome = str(f.parent).split('/')\n",
    "    data.append([data_set, int(outcome=='pos'), f.read_text(encoding='latin1')])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame(data, columns=['dataset', 'label', 'review']).sample(frac=1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = data.loc[data.dataset=='train', ['label', 'review']]\n",
    "test_data = data.loc[data.dataset=='test', ['label', 'review']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    12500\n",
       "0    12500\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data.label.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    12500\n",
       "0    12500\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data.label.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keras provides a tokenizer that we use to convert the text documents to integer-encoded sequences, as shown here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_words = 10000\n",
    "t = Tokenizer(num_words=num_words, \n",
    "              lower=True, \n",
    "              oov_token=2)\n",
    "t.fit_on_texts(train_data.review)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "88586"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab_size = len(t.word_index) + 1\n",
    "vocab_size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data_encoded = t.texts_to_sequences(train_data.review)\n",
    "test_data_encoded = t.texts_to_sequences(test_data.review)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_length = 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pad Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also use the pad_sequences function to convert the list of lists (of unequal length) to stacked sets of padded and truncated arrays for both the train and test datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(25000, 100)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train_padded = pad_sequences(train_data_encoded, \n",
    "                            maxlen=max_length, \n",
    "                            padding='post',\n",
    "                           truncating='post')\n",
    "y_train = train_data['label']\n",
    "X_train_padded.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(25000, 100)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test_padded = pad_sequences(test_data_encoded, \n",
    "                            maxlen=max_length, \n",
    "                            padding='post',\n",
    "                           truncating='post')\n",
    "y_test = test_data['label']\n",
    "X_test_padded.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming we have downloaded and unzipped the GloVe data to the location indicated in the code, we now create a dictionary that maps GloVe tokens to 100-dimensional real-valued vectors, as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the whole embedding into memory\n",
    "glove_path = Path('data/glove/glove.6B.100d.txt')\n",
    "embeddings_index = dict()\n",
    "\n",
    "for line in glove_path.open(encoding='latin1'):\n",
    "    values = line.split()\n",
    "    word = values[0]\n",
    "    try:\n",
    "        coefs = np.asarray(values[1:], dtype='float32')\n",
    "    except:\n",
    "        continue\n",
    "    embeddings_index[word] = coefs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 399,883 word vectors.\n"
     ]
    }
   ],
   "source": [
    "print('Loaded {:,d} word vectors.'.format(len(embeddings_index)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are around 340,000 word vectors that we use to create an embedding matrix that matches the vocabulary so that the RNN model can access embeddings by the token index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_matrix = np.zeros((vocab_size, 100))\n",
    "for word, i in t.word_index.items():\n",
    "    embedding_vector = embeddings_index.get(word)\n",
    "    if embedding_vector is not None:\n",
    "        embedding_matrix[i] = embedding_vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(88586, 100)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding_matrix.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Model Architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The difference between this and the RNN setup in the previous example is that we are going to pass the embedding matrix to the embedding layer and set it to non-trainable, so that the weights remain fixed during training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_size = 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "embedding_1 (Embedding)      (None, 100, 100)          8858600   \n",
      "_________________________________________________________________\n",
      "gru_1 (GRU)                  (None, 32)                12768     \n",
      "_________________________________________________________________\n",
      "dense_1 (Dense)              (None, 1)                 33        \n",
      "=================================================================\n",
      "Total params: 8,871,401\n",
      "Trainable params: 12,801\n",
      "Non-trainable params: 8,858,600\n",
      "_________________________________________________________________\n"
     ]
    }
   ],
   "source": [
    "rnn = Sequential([\n",
    "    Embedding(input_dim=vocab_size, \n",
    "              output_dim= embedding_size, \n",
    "              input_length=max_length,\n",
    "              weights=[embedding_matrix], \n",
    "              trainable=False),\n",
    "    GRU(units=32,  dropout=0.2, recurrent_dropout=0.2),\n",
    "    Dense(1, activation='sigmoid')\n",
    "])\n",
    "rnn.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn.compile(loss='binary_crossentropy', \n",
    "            optimizer='RMSProp', \n",
    "            metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn_path = 'models/imdb.gru_pretrained.weights.best.hdf5'\n",
    "checkpointer = ModelCheckpoint(filepath=rnn_path,\n",
    "                              monitor='val_loss',\n",
    "                              save_best_only=True,\n",
    "                              save_weights_only=True,\n",
    "                              period=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "early_stopping = EarlyStopping(monitor='val_loss', \n",
    "                              patience=5,\n",
    "                              restore_best_weights=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 25000 samples, validate on 25000 samples\n",
      "Epoch 1/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.6392 - acc: 0.6218 - val_loss: 0.5223 - val_acc: 0.7461\n",
      "Epoch 2/25\n",
      "25000/25000 [==============================] - 62s 2ms/step - loss: 0.5263 - acc: 0.7458 - val_loss: 0.4666 - val_acc: 0.7781\n",
      "Epoch 3/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.4834 - acc: 0.7722 - val_loss: 0.4407 - val_acc: 0.7918\n",
      "Epoch 4/25\n",
      "25000/25000 [==============================] - 62s 2ms/step - loss: 0.4550 - acc: 0.7869 - val_loss: 0.4264 - val_acc: 0.7986\n",
      "Epoch 5/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.4395 - acc: 0.7958 - val_loss: 0.4131 - val_acc: 0.8066\n",
      "Epoch 6/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.4247 - acc: 0.8037 - val_loss: 0.4040 - val_acc: 0.8096\n",
      "Epoch 7/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.4113 - acc: 0.8117 - val_loss: 0.4022 - val_acc: 0.8100\n",
      "Epoch 8/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.4043 - acc: 0.8129 - val_loss: 0.3950 - val_acc: 0.8144\n",
      "Epoch 9/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3985 - acc: 0.8156 - val_loss: 0.3990 - val_acc: 0.8120\n",
      "Epoch 10/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3917 - acc: 0.8216 - val_loss: 0.3938 - val_acc: 0.8176\n",
      "Epoch 11/25\n",
      "25000/25000 [==============================] - 67s 3ms/step - loss: 0.3888 - acc: 0.8233 - val_loss: 0.3930 - val_acc: 0.8179\n",
      "Epoch 12/25\n",
      "25000/25000 [==============================] - 66s 3ms/step - loss: 0.3817 - acc: 0.8265 - val_loss: 0.3871 - val_acc: 0.8192\n",
      "Epoch 13/25\n",
      "25000/25000 [==============================] - 66s 3ms/step - loss: 0.3794 - acc: 0.8296 - val_loss: 0.4039 - val_acc: 0.8134\n",
      "Epoch 14/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3753 - acc: 0.8311 - val_loss: 0.3843 - val_acc: 0.8237\n",
      "Epoch 15/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3682 - acc: 0.8328 - val_loss: 0.3906 - val_acc: 0.8196\n",
      "Epoch 16/25\n",
      "25000/25000 [==============================] - 67s 3ms/step - loss: 0.3675 - acc: 0.8374 - val_loss: 0.3822 - val_acc: 0.8262\n",
      "Epoch 17/25\n",
      "25000/25000 [==============================] - 65s 3ms/step - loss: 0.3641 - acc: 0.8373 - val_loss: 0.3799 - val_acc: 0.8253\n",
      "Epoch 18/25\n",
      "25000/25000 [==============================] - 64s 3ms/step - loss: 0.3630 - acc: 0.8382 - val_loss: 0.3875 - val_acc: 0.8205\n",
      "Epoch 19/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3586 - acc: 0.8417 - val_loss: 0.3820 - val_acc: 0.8267\n",
      "Epoch 20/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3600 - acc: 0.8387 - val_loss: 0.3862 - val_acc: 0.8262\n",
      "Epoch 21/25\n",
      "25000/25000 [==============================] - 63s 3ms/step - loss: 0.3560 - acc: 0.8395 - val_loss: 0.3847 - val_acc: 0.8249\n",
      "Epoch 22/25\n",
      "25000/25000 [==============================] - 64s 3ms/step - loss: 0.3542 - acc: 0.8408 - val_loss: 0.3815 - val_acc: 0.8264\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x7f16815803c8>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rnn.fit(X_train_padded, \n",
    "        y_train, \n",
    "        batch_size=32, \n",
    "        epochs=25, \n",
    "        validation_data=(X_test_padded, y_test), \n",
    "        callbacks=[checkpointer, early_stopping],\n",
    "        verbose=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnn.load_weights(rnn_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.910672304"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_score = rnn.predict(X_test_padded)\n",
    "roc_auc_score(y_score=y_score.squeeze(), y_true=y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
